Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?
نویسندگان
چکیده
The classifier built from a data set with a highly skewed class distribution generally predicts the more frequently occurring classes much more often than the infrequently occurring classes. This is largely due to the fact that most classifiers are designed to maximize accuracy. In many instances, such as for medical diagnosis, this classification behavior is unacceptable because the minority class is the class of primary interest (i.e., it has a much higher misclassification cost than the majority class). In this paper we compare three methods for dealing with data that has a skewed class distribution and nonuniform misclassification costs. The first method incorporates the misclassification costs into the learning algorithm while the other two methods employ oversampling or undersampling to make the training data more balanced. In this paper we empirically compare the effectiveness of these methods in order to determine which produces the best overall classifier—and under what circumstances.
منابع مشابه
A New Formulation for Cost-Sensitive Two Group Support Vector Machine with Multiple Error Rate
Support vector machine (SVM) is a popular classification technique which classifies data using a max-margin separator hyperplane. The normal vector and bias of the mentioned hyperplane is determined by solving a quadratic model implies that SVM training confronts by an optimization problem. Among of the extensions of SVM, cost-sensitive scheme refers to a model with multiple costs which conside...
متن کاملExperiments with Cost-Sensitive Feature Evaluation
Many machine learning tasks contain feature evaluation as one of its important components. This work is concerned with attribute estimation in the problems where class distribution is unbalanced or the misclassification costs are unequal. We test some common attribute evaluation heuristics and propose their cost-sensitive adaptations. The new measures are tested on problems which can reveal the...
متن کاملLearning When Data Sets are Imbalanced and When Costs are Unequal and Unknown
The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from roc analysis to help with classifier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. W...
متن کاملMeasuring Accuracy between Ensemble Methods: AdaBoost.NC vs. SMOTE.ENN
The imbalanced class distribution is one of the main issue in data mining. This problem exists in multi class imbalance, when samples containing in one class are greater or lower than that of other classes. Most existing imbalance learning techniques are only designed and tested for two-class scenarios. The new negative correlation learning (NCL) algorithm for classification ensembles, called A...
متن کاملAn Empirical Study of Cost-sensitive Classification in Campaign Management
Extremely unbalanced data and unequal costs are key challenges in data mining for campaign management in CRM. This paper presents an empirical study of cost-sensitive classification in a real-world campaign analysis in the newspaper industry, where the response rate is extremely low while the two types of misclassification costs are very different. Incorporating cost information provided by dom...
متن کامل